Search CORE

155 research outputs found

Decreasing log data of multi-tier services for effective request tracing

Author: Sang Bo
Tian Guanhua
Zhan Jianfeng
Publication venue
Publication date: 04/03/2010
Field of study

Previous work shows request tracing systems help understand and debug the performance problems of multi-tier services. However, for large-scale data centers, more than hundreds of thousands of service instances provide online service at the same time. Previous work such as white-box or black box tracing systems will produce large amount of log data, which would be correlated into large quantities of causal paths for performance debugging. In this paper, we propose an innovative algorithm to eliminate valueless logs of multitiers services. Our experiment shows our method filters 84% valueless causal paths and is promising to be used in large-scale data centers

arXiv.org e-Print Archive

CiteSeerX

Characterization and Architectural Implications of Big Data Workloads

Author: Han Rui
Jia Zhen
Wang Lei
Zhan Jianfeng
Publication venue
Publication date: 25/06/2015
Field of study

Big data areas are expanding in a fast way in terms of increasing workloads and runtime systems, and this situation imposes a serious challenge to workload characterization, which is the foundation of innovative system and architecture design. The previous major efforts on big data benchmarking either propose a comprehensive but a large amount of workloads, or only select a few workloads according to so-called popularity, which may lead to partial or even biased observations. In this paper, on the basis of a comprehensive big data benchmark suite---BigDataBench, we reduced 77 workloads to 17 representative workloads from a micro-architectural perspective. On a typical state-of-practice platform---Intel Xeon E5645, we compare the representative big data workloads with SPECINT, SPECCFP, PARSEC, CloudSuite and HPCC. After a comprehensive workload characterization, we have the following observations. First, the big data workloads are data movement dominated computing with more branch operations, taking up to 92% percentage in terms of instruction mix, which places them in a different class from Desktop (SPEC CPU2006), CMP (PARSEC), HPC (HPCC) workloads. Second, corroborating the previous work, Hadoop and Spark based big data workloads have higher front-end stalls. Comparing with the traditional workloads i. e. PARSEC, the big data workloads have larger instructions footprint. But we also note that, in addition to varied instruction-level parallelism, there are significant disparities of front-end efficiencies among different big data workloads. Third, we found complex software stacks that fail to use state-of-practise processors efficiently are one of the main factors leading to high front-end stalls. For the same workloads, the L1I cache miss rates have one order of magnitude differences among diverse implementations with different software stacks

arXiv.org e-Print Archive

Automatic Performance Debugging of SPMD-style Parallel Programs

Author: Liu Xu
Meng Dan
Shi Weisong
Wang Lei
Yuan Lin
Zhan Jianfeng
Zhan Kunlin
Publication venue
Publication date: 31/03/2011
Field of study

The simple program and multiple data (SPMD) programming model is widely used for both high performance computing and Cloud computing. In this paper, we design and implement an innovative system, AutoAnalyzer, that automates the process of debugging performance problems of SPMD-style parallel programs, including data collection, performance behavior analysis, locating bottlenecks, and uncovering their root causes. AutoAnalyzer is unique in terms of two features: first, without any apriori knowledge, it automatically locates bottlenecks and uncovers their root causes for performance optimization; second, it is lightweight in terms of the size of performance data to be collected and analyzed. Our contributions are three-fold: first, we propose two effective clustering algorithms to investigate the existence of performance bottlenecks that cause process behavior dissimilarity or code region behavior disparity, respectively; meanwhile, we present two searching algorithms to locate bottlenecks; second, on a basis of the rough set theory, we propose an innovative approach to automatically uncovering root causes of bottlenecks; third, on the cluster systems with two different configurations, we use two production applications, written in Fortran 77, and one open source code-MPIBZIP2 (http://compression.ca/mpibzip2/), written in C++, to verify the effectiveness and correctness of our methods. For three applications, we also propose an experimental approach to investigating the effects of different metrics on locating bottlenecks.Comment: 16 pages, 23 figures. Accepted by Journal of Parallel and Distributed Computing (JPDC

arXiv.org e-Print Archive

Automatic Performance Debugging of SPMD Parallel Programs

Author: Liu Xu
Meng Dan
Tu Bibo
Yuan Lin
Zhan Jianfeng
Publication venue
Publication date: 23/02/2010
Field of study

Automatic performance debugging of parallel applications usually involves two steps: automatic detection of performance bottlenecks and uncovering their root causes for performance optimization. Previous work fails to resolve this challenging issue in several ways: first, several previous efforts automate analysis processes, but present the results in a confined way that only identifies performance problems with apriori knowledge; second, several tools take exploratory or confirmatory data analysis to automatically discover relevant performance data relationships. However, these efforts do not focus on locating performance bottlenecks or uncovering their root causes. In this paper, we design and implement an innovative system, AutoAnalyzer, to automatically debug the performance problems of single program multi-data (SPMD) parallel programs. Our system is unique in terms of two dimensions: first, without any apriori knowledge, we automatically locate bottlenecks and uncover their root causes for performance optimization; second, our method is lightweight in terms of size of collected and analyzed performance data. Our contribution is three-fold. First, we propose a set of simple performance metrics to represent behavior of different processes of parallel programs, and present two effective clustering and searching algorithms to locate bottlenecks. Second, we propose to use the rough set algorithm to automatically uncover the root causes of bottlenecks. Third, we design and implement the AutoAnalyzer system, and use two production applications to verify the effectiveness and correctness of our methods. According to the analysis results of AutoAnalyzer, we optimize two parallel programs with performance improvements by minimally 20% and maximally 170%.Comment: The preliminary version appeared on SC 08 workshop on Node Level Parallelism for Large Scale Supercomputers. The web site is http://iss.ices.utexas.edu/sc08nlplss/program.htm

arXiv.org e-Print Archive

AccuracyTrader: Accuracy-aware Approximate Processing for Low Tail Latency and High Result Accuracy in Cloud Online Services

Author: Chang Fugui
Han Rui
Huang Siguang
Tang Fei
Zhan Jianfeng
Publication venue
Publication date: 10/07/2016
Field of study

Modern latency-critical online services such as search engines often process requests by consulting large input data spanning massive parallel components. Hence the tail latency of these components determines the service latency. To trade off result accuracy for tail latency reduction, existing techniques use the components responding before a specified deadline to produce approximate results. However, they may skip a large proportion of components when load gets heavier, thus incurring large accuracy losses. This paper presents AccuracyTrader that produces approximate results with small accuracy losses while maintaining low tail latency. AccuracyTrader aggregates information of input data on each component to create a small synopsis, thus enabling all components producing initial results quickly using their synopses. AccuracyTrader also uses synopses to identify the parts of input data most related to arbitrary requests' result accuracy, thus first using these parts to improve the produced results in order to minimize accuracy losses. We evaluated AccuracyTrader using workloads in real services. The results show: (i) AccuracyTrader reduces tail latency by over 40 times with accuracy losses of less than 7% compared to existing exact processing techniques; (ii) when using the same latency, AccuracyTrader reduces accuracy losses by over 13 times comparing to existing approximate processing techniques.Comment: 10 pages, 8 figures, 2 table

arXiv.org e-Print Archive

PowerTracer: Tracing requests in multi-tier services to save cluster power consumption

Author: Sang Bo
Wang Haining
Wang Lei
Yuan Lin
Zhan Jianfeng
Publication venue
Publication date: 30/07/2010
Field of study

As energy proportional computing gradually extends the success of DVFS (Dynamic voltage and frequency scaling) to the entire system, DVFS control algorithms will play a key role in reducing server clusters' power consumption. The focus of this paper is to provide accurate cluster-level DVFS control for power saving in a server cluster. To achieve this goal, we propose a request tracing approach that online classifies the major causal path patterns of a multi-tier service and monitors their performance data as a guide for accurate DVFS control. The request tracing approach significantly decreases the time cost of performance profiling experiments that aim to establish the empirical performance model. Moreover, it decreases the controller complexity so that we can introduce a much simpler feedback controller, which only relies on the single-node DVFS modulation at a time as opposed to varying multiple CPU frequencies simultaneously. Based on the request tracing approach, we present a hybrid DVFS control system that combines an empirical performance model for fast modulation at different load levels and a simpler feedback controller for adaption. We implement a prototype of the proposed system, called PowerTracer, and conduct extensive experiments on a 3-tier platform. Our experimental results show that PowerTracer outperforms its peer in terms of power saving and system performance.Comment: 10 pages, 22 figure

arXiv.org e-Print Archive

Anomaly Analysis for Co-located Datacenter Workloads in the Alibaba Cluster

Author: Cao Zheng
Li Jinheng
Ren Rui
Wang Lei
Zhan Jianfeng
Publication venue
Publication date: 13/11/2018
Field of study

In warehouse-scale cloud datacenters, co-locating online services and offline batch jobs is an efficient approach to improving datacenter utilization. To better facilitate the understanding of interactions among the co-located workloads and their real-world operational demands, Alibaba recently released a cluster usage and co-located workload dataset, which is the first publicly dataset with precise information about the category of each job. In this paper, we perform a deep analysis on the released Alibaba workload dataset, from the perspective of anomaly analysis and diagnosis. Through data preprocessing, node similarity analysis based on Dynamic Time Warping (DTW), co-located workloads characteristics analysis and anomaly analysis based on iForest, we reveals several insights including: (1) The performance discrepancy of machines in Alibaba's production cluster is relatively large, for the distribution and resource utilization of co-located workloads is not balanced. For instance, the resource utilization (especially memory utilization) of batch jobs is fluctuating and not as stable as that of online containers, and the reason is that online containers are long-running jobs with more memory-demanding and most batch jobs are short jobs, (2) Based on the distribution of co-located workload instance numbers, the machines can be classified into 8 workload distribution categories1. And most patterns of machine resource utilization curves are similar in the same workload distribution category. (3) In addition to the system failures, unreasonable scheduling and workload imbalance are the main causes of anomalies in Alibaba's cluster

arXiv.org e-Print Archive

PhoenixCloud: Provisioning Resources for Heterogeneous Workloads in Cloud Computing

Author: Gong Shimin
Shi Weisong
Wang Lei
Zang Xiutao
Zhan Jianfeng
Publication venue
Publication date: 20/07/2010
Field of study

As more and more service providers choose Cloud platforms, which is provided by third party resource providers, resource providers needs to provision resources for heterogeneous workloads in different Cloud scenarios. Taking into account the dramatic differences of heterogeneous workloads, can we coordinately provision resources for heterogeneous workloads in Cloud computing? In this paper we focus on this important issue, which is investigated by few previous work. Our contributions are threefold: (1) we respectively propose a coordinated resource provisioning solution for heterogeneous workloads in two typical Cloud scenarios: first, a large organization operates a private Cloud for two heterogeneous workloads; second, a large organization or two service providers running heterogeneous workloads revert to a public Cloud; (2) we build an agile system PhoenixCloud that enables a resource provider to create coordinated runtime environments on demand for heterogeneous workloads when they are consolidated on a Cloud site; and (3) A comprehensive evaluation has been performed in experiments. For two typical heterogeneous workload traces: parallel batch jobs and Web services, our experiments show that: a) in a private Cloud scenario, when the throughput is almost same like that of a dedicated cluster system, our solution decreases the configuration size of a cluster by about 40%; b) in a public Cloud scenario, our solution decreases not only the total resource consumption, but also the peak resource consumption maximally to 31% with respect to that of EC2 +RightScale solution.Comment: 18 pages. This is an extended version of our CCA 08 paper(The First Workshop of Cloud Computing and its Application, CCA08, Chicago, 2008): J. Zhan L. Wang, B. Tu, Y. Li, P. Wang, W. Zhou, D. Meng. 2008. Phoenix Cloud: Consolidating Different Computing Loads on Shared Cluster System for Large Organization. The modified version can be found on http://arxiv.org/abs/0906.134

arXiv.org e-Print Archive

HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems

Author: Cheng Jiechao
He Xiwen
Luo Chunjie
Ren Rui
Wang Lei
Zhan Jianfeng
Publication venue
Publication date: 21/11/2017
Field of study

With tremendous growing interests in Big Data systems, analyzing and facilitating their performance improvement become increasingly important. Although there have much research efforts for improving Big Data systems performance, efficiently analysing and diagnosing performance bottlenecks over these massively distributed systems remain a major challenge. In this paper, we propose a spatio-temporal correlation analysis approach based on stage characteristic and distribution characteristic of Big Data applications, which can associate the multi-level performance data fine-grained. On the basis of correlation data, we define some priori rules, select features and vectorize the corresponding datasets for different performance bottlenecks, such as, workload imbalance, data skew, abnormal node and outlier metrics. And then, we utilize the data and model driven algorithms for bottlenecks detection and diagnosis. In addition, we design and develop a lightweight, extensible tool HybridTune, and validate the diagnosis effectiveness of our tool with BigDataBench on several benchmark experiments in which the outperform state-of-the-art methods. Our experiments show that the accuracy of abnormal/outlier detection we obtained reaches about 80%. At last, we report several Spark and Hadoop use cases, which are demonstrated how HybridTune supports users to carry out the performance analysis and diagnosis efficiently on the Spark and Hadoop applications, and our experiences demonstrate HybridTune can help users find the performance bottlenecks and provide optimization recommendations

arXiv.org e-Print Archive

Comparison and Benchmarking of AI Models and Frameworks on Mobile Devices

Author: Dai Jiahui
Gao Wanling
He Xiwen
Luo Chunjie
Wang Lei
Zhan Jianfeng
Publication venue
Publication date: 07/05/2020
Field of study

Due to increasing amounts of data and compute resources, deep learning achieves many successes in various domains. The application of deep learning on the mobile and embedded devices is taken more and more attentions, benchmarking and ranking the AI abilities of mobile and embedded devices becomes an urgent problem to be solved. Considering the model diversity and framework diversity, we propose a benchmark suite, AIoTBench, which focuses on the evaluation of the inference abilities of mobile and embedded devices. AIoTBench covers three typical heavy-weight networks: ResNet50, InceptionV3, DenseNet121, as well as three light-weight networks: SqueezeNet, MobileNetV2, MnasNet. Each network is implemented by three frameworks which are designed for mobile and embedded devices: Tensorflow Lite, Caffe2, Pytorch Mobile. To compare and rank the AI capabilities of the devices, we propose two unified metrics as the AI scores: Valid Images Per Second (VIPS) and Valid FLOPs Per Second (VOPS). Currently, we have compared and ranked 5 mobile devices using our benchmark. This list will be extended and updated soon after

arXiv.org e-Print Archive